Goto

Collaborating Authors

 metal cube


Compose Visual Relations

Neural Information Processing Systems

A large brown metal cube belowa large green rubber cylinder A large gray metal sphereabove a small red metal cube A small red metal cube behinda large brown metal cube A large brown metal cube below a large green rubber cylinder A large gray metal sphereabove a small red metal cube A small red metal cube on the left of a large brown metal cube A large brown metal cube below a large green rubber cylinder A blue objectinfrontofa gray object! A gray object on the left ofa green object A green object behindablue object! A blue objectin front ofa gray object! A gray object behind a green object! A green object on the left ofa blue object! A blue object behind a gray object A gray object on the left ofa green object A green object on the right ofa gray object CLIPQuery imageFine-tuned CLIPOurs( a) Top 1 image-text retrieval result on i Gibsonscenes.(


Learning to Reason with Mixture of Tokens

Jain, Adit, Rappazzo, Brendan

arXiv.org Artificial Intelligence

Reinforcement learning with verifiable rewards (RLVR) has become a leading approach for improving large language model (LLM) reasoning capabilities. Most current methods follow variants of Group Relative Policy Optimization, which samples multiple reasoning completions, scores them relative to each other, and adjusts the policy accordingly. However, these approaches invariably sample discrete tokens at each reasoning step, discarding the rich distributional information in the model's probability distribution over candidate tokens. While preserving and utilizing this distributional information has proven beneficial in non-RL settings, current RLVR methods seem to be unnecessarily constraining the reasoning search space by not using this information. To address this limitation, we investigate mixture-of-token generation (MoT-G) in RLVR. We present a unified framework that generalizes existing MoT-G approaches, including existing training-free methods that construct mixture embeddings as weighted sums over token embeddings, and extend RLVR to operate directly in this continuous mixture space for generating chain-of-thought. Evaluating two MoT-G variants on Reasoning-Gym, a suite of reasoning-intensive language tasks, we find that MoT--G methods achieve substantial improvements (5--35 \% gains on 7 out of 10 tasks) compared to standard decoding with the Qwen2.5-1.5B model, while reaching comparable accuracy with half the number of trajectories, suggesting improved training efficiency. Through comprehensive hidden-state and token-level analyses, we provide evidence that MoT--G's benefits may stem from its ability to maintain higher hidden-state entropy throughout the reasoning process and promote exploration in token space.




LLMs Can Plan Only If We Tell Them

Sel, Bilgehan, Jia, Ruoxi, Jin, Ming

arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated significant capabilities in natural language processing and reasoning, yet their effectiveness in autonomous planning has been under debate. While existing studies have utilized LLMs with external feedback mechanisms or in controlled environments for planning, these approaches often involve substantial computational and development resources due to the requirement for careful design and iterative backprompting. Moreover, even the most advanced LLMs like GPT-4 struggle to match human performance on standard planning benchmarks, such as the Blocksworld, without additional support. This paper investigates whether LLMs can independently generate long-horizon plans that rival human baselines. Our novel enhancements to Algorithm-of-Thoughts (AoT), which we dub AoT+, help achieve state-of-the-art results in planning benchmarks out-competing prior methods and human baselines all autonomously.


MEWL: Few-shot multimodal word learning with referential uncertainty

Jiang, Guangyuan, Xu, Manjie, Xin, Shiji, Liang, Wei, Peng, Yujia, Zhang, Chi, Zhu, Yixin

arXiv.org Artificial Intelligence

Without explicit feedback, humans can rapidly learn the meaning of words. Children can acquire a new word after just a few passive exposures, a process known as fast mapping. This word learning capability is believed to be the most fundamental building block of multimodal understanding and reasoning. Despite recent advancements in multimodal learning, a systematic and rigorous evaluation is still missing for human-like word learning in machines. To fill in this gap, we introduce the MachinE Word Learning (MEWL) benchmark to assess how machines learn word meaning in grounded visual scenes. MEWL covers human's core cognitive toolkits in word learning: cross-situational reasoning, bootstrapping, and pragmatic learning. Specifically, MEWL is a few-shot benchmark suite consisting of nine tasks for probing various word learning capabilities. These tasks are carefully designed to be aligned with the children's core abilities in word learning and echo the theories in the developmental literature. By evaluating multimodal and unimodal agents' performance with a comparative analysis of human performance, we notice a sharp divergence in human and machine word learning. We further discuss these differences between humans and machines and call for human-like few-shot word learning in machines.


Multiset-Equivariant Set Prediction with Approximate Implicit Differentiation

Zhang, Yan, Zhang, David W., Lacoste-Julien, Simon, Burghouts, Gertjan J., Snoek, Cees G. M.

arXiv.org Machine Learning

Most set prediction models in deep learning use set-equivariant operations, but they actually operate on multisets. We show that set-equivariant functions cannot represent certain functions on multisets, so we introduce the more appropriate notion of multiset-equivariance. We identify that the existing Deep Set Prediction Network (DSPN) can be multiset-equivariant without being hindered by set-equivariance and improve it with approximate implicit differentiation, allowing for better optimization while being faster and saving memory. In a range of toy experiments, we show that the perspective of multiset-equivariance is beneficial and that our changes to DSPN achieve better results in most cases. On CLEVR object property prediction, we substantially improve over the state-of-the-art Slot Attention from 8% to 77% in one of the strictest evaluation metrics because of the benefits made possible by implicit differentiation.


Learning to Compose Visual Relations

Liu, Nan, Li, Shuang, Du, Yilun, Tenenbaum, Joshua B., Torralba, Antonio

arXiv.org Artificial Intelligence

The visual world around us can be described as a structured set of objects and their associated relations. An image of a room may be conjured given only the description of the underlying objects and their associated relations. While there has been significant work on designing deep neural networks which may compose individual objects together, less work has been done on composing the individual relations between objects. A principal difficulty is that while the placement of objects is mutually independent, their relations are entangled and dependent on each other. To circumvent this issue, existing works primarily compose relations by utilizing a holistic encoder, in the form of text or graphs. In this work, we instead propose to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner. We show that such a factorized decomposition allows the model to both generate and edit scenes that have multiple sets of relations more faithfully. We further show that decomposition enables our model to effectively understand the underlying relational scene structure.


Deep Set Prediction Networks

Zhang, Yan, Hare, Jonathon, Prügel-Bennett, Adam

arXiv.org Machine Learning

We study the problem of predicting a set from a feature vector with a deep neural network. Existing approaches ignore the set structure of the problem and suffer from discontinuity issues as a result. We propose a general model for predicting sets that properly respects the structure of sets and avoids this problem. With a single feature vector as input, we show that our model is able to auto-encode point sets, predict bounding boxes of the set of objects in an image, and predict the attributes of these objects in an image.


This advanced neural network can explain its thought process (finally)

#artificialintelligence

The type of artificial intelligence known as a neural network can be trained to complete tasks once thought to be exclusive to humans, such as driving a car, creating visual art, or composing a heavy metal album. But neural networks have a big problem: they're really complicated. They're so complex, in fact, that researchers have often struggled to explain precisely why they make specific decisions. Now, researchers at the Massachusetts Institute of Technology (MIT) say they've created a neural network that can explain the steps it took to solve a problem -- an advance that could help us better understand how the technology works and alleviate safety concerns in riskier applications, like self-driving cars. The new algorithm, called the Transparency by Design Network (TbD-net), breaks down the process of recognizing an image into subtasks.